{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial for Run Your Own Task in MT-DNN\n",
    "To run your own task with MT-DNN is actually easy with 3 steps:\n",
    "1. add your task into task define config\n",
    "2. prepare your task data with correct schema\n",
    "3. specify your task name in args for train.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Add Your task into task define config\n",
    "MT-DNN adapts [yaml](https://en.wikipedia.org/wiki/YAML) file as config file format. \n",
    "Here is a piece of example task define config :\n",
    "<pre>cola:\n",
    "  data_format: PremiseOnly\n",
    "  dropout_p: 0.05\n",
    "  enable_san: false\n",
    "  metric_meta:\n",
    "  - ACC\n",
    "  - MCC\n",
    "  loss: CeCriterion\n",
    "  n_class: 2\n",
    "  task_type: Classification</pre>\n",
    "\n",
    "We will take \"mnli\" as example to show you what are these fields and add them step by step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add task definition for your task\n",
    "\n",
    "<pre>mnli\n",
    "  task_type: Classification\n",
    "  n_class: 3</pre> \n",
    "\n",
    "  speicfy what task type it is in your own task, choose one from types in below:\n",
    "  1. Classification\n",
    "  2. Regression\n",
    "  3. Ranking\n",
    "  4. Span\n",
    "  5. SeqenceLabeling\n",
    "  More details in [data_utils/task_def.py](../data_utils/task_def.py)\n",
    "  \n",
    "Also, specify how many classes in total in your task, under \"n_class\" field.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add data information for your task\n",
    "\n",
    "<pre>mnli\n",
    "  task_type: Classification\n",
    "  n_class: 3\n",
    "  data_format: PremiseAndOneHypothesis \n",
    "  split_names: # optional when your sets are already named as TASK_train/TASK_dev/TASK_test\n",
    "  - train\n",
    "  - matched_dev\n",
    "  - mismatched_dev\n",
    "  - matched_test\n",
    "  - mismatched_test\n",
    "  labels: # optional when your labels are int or float\n",
    "  - contradiction\n",
    "  - neutral\n",
    "  - entailment\n",
    "  </pre> \n",
    "  \n",
    "  choose the correct data format based on your task, currently we support 4 types of data formats, coresponds to different tasks:\n",
    "  1. \"PremiseOnly\" : single text, i.e. premise. Data format is \"id\" \\t \"label\" \\t \"premise\" .\n",
    "  2. \"PremiseAndOneHypothesis\" : two texts, i.e. one premise and one hypothesis. Data format is \"id\" \\t \"label\" \\t \"premise\" \\t \"hypothesis\".\n",
    "  3. \"PremiseAndMultiHypothesis\" : one text as premise and multiple candidates of texts as hypothesis. Data format is \"id\" \\t \"label\" \\t \"premise\" \\t \"hypothesis_1\" \\t \"hypothesis_2\" \\t ... \\t \"hypothesis_n\".\n",
    "  4. \"Seqence\" : sequence tagging. Data format is \"id\" \\t \"label\" \\t \"premise\".\n",
    "  \n",
    "  More details in [data_utils/task_def.py](../data_utils/task_def.py)\n",
    " \n",
    "The code is using surfix to distinguish what type of set it is (\"_train\",\"_dev\" and \"_test\"). So:\n",
    "  1. make sure your train set is named as \"TASK_train\" (replace TASK with your task name)\n",
    "  2. make sure your dev set and test set ends with \"_dev\" and \"_test\".\n",
    "\n",
    "If you prefer using readable labels (text), you can specify what labels are there in your data set, under \"labels\" field."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add hyper-parameters for your task\n",
    "\n",
    "<pre>mnli\n",
    "  task_type: Classification\n",
    "  n_class: 3\n",
    "  data_format: PremiseAndOneHypothesis \n",
    "  split_names: # optional when your sets are already named as TASK_train/TASK_dev/TASK_test\n",
    "  - train\n",
    "  - matched_dev\n",
    "  - mismatched_dev\n",
    "  - matched_test\n",
    "  - mismatched_test\n",
    "  labels: # optional when your labels are int or float\n",
    "  - contradiction\n",
    "  - neutral\n",
    "  - entailment\n",
    "  dropout_p: 0.3\n",
    "  enable_san: true\n",
    "  </pre>\n",
    "\n",
    "  \n",
    "  we support assigning different dropout prob for different task as well, please assign the prob for your task in \"dropout_p\" field;\n",
    "  \n",
    "  set \"true\" if you would like to use Stochastic Answer Networks([SAN](https://www.aclweb.org/anthology/P18-1157.pdf)) for your task."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add metric and loss for your task\n",
    "\n",
    "<pre>mnli\n",
    "  task_type: Classification\n",
    "  n_class: 3\n",
    "  data_format: PremiseAndOneHypothesis \n",
    "  split_names: # optional when your sets are already named as TASK_train/TASK_dev/TASK_test\n",
    "  - train\n",
    "  - matched_dev\n",
    "  - mismatched_dev\n",
    "  - matched_test\n",
    "  - mismatched_test\n",
    "  labels: # optional when your labels are int or float\n",
    "  - contradiction\n",
    "  - neutral\n",
    "  - entailment\n",
    "  dropout_p: 0.3\n",
    "  enable_san: true\n",
    "  metric_meta:\n",
    "  - ACC\n",
    "  loss: CeCriterion\n",
    "  </pre>\n",
    "  \n",
    "  you can choose multiple metrics from : ACC,F1,MCC,Pearson,Spearman,AUC and SeqEval. More details in [data_utils/metrics.py](../data_utils/metrics.py);\n",
    "  \n",
    "  you can choose loss from : \n",
    "    CeCriterion,RegCriterion,RankCeCriterion,SpanCeCriterion and SeqCeCriterion. More details in [data_utils/loss.py](../data_utils/loss.py)\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Prepare your data in correct format\n",
    "\n",
    "remember what \"data_format\" you set in config? please follow the detailed data format below, which we also mentioned above, to prepare your data:\n",
    "\n",
    "1. \"PremiseOnly\" : single text, i.e. premise. Data format is \"id\" \\t \"label\" \\t \"premise\" .\n",
    "2. \"PremiseAndOneHypothesis\" : two texts, i.e. one premise and one hypothesis. Data format is \"id\" \\t \"label\" \\t \"premise\" \\t \"hypothesis\".\n",
    "3. \"PremiseAndMultiHypothesis\" : one text as premise and multiple candidates of texts as hypothesis. Data format is \"id\" \\t \"label\" \\t \"premise\" \\t \"hypothesis_1\" \\t \"hypothesis_2\" \\t ... \\t \"hypothesis_n\".\n",
    "4. \"Seqence\" : sequence tagging. Data format is \"id\" \\t \"label\" \\t \"premise\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenization and Convert to Json\n",
    "\n",
    "the training code reads tokenized data in json format. please use \"prepro_std.py\" to do tokenization and convert your data into json format.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Onboard your task into training!\n",
    "\n",
    "1. Add your piece of config into overall config for all tasks\n",
    "2. append your task and test_set prefix in train.py args : \"--train_datasets EXISTING_TASKS,mnli --test_datasets EXISTING_TASK_TEST_SETS,mnli_mismatched,mnli_matched\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}